A Paradigm-Based Finite State Morphological Analyzer for Marathi
نویسندگان
چکیده
A morphological analyzer forms the foundation for many NLP applications of Indian Languages. In this paper, we propose and evaluate the morphological analyzer for Marathi, an inflectional language. The morphological analyzer exploits the efficiency and flexibility offered by finite state machines in modeling the morphotactics while using the well devised system of paradigms to handle the stem alternations intelligently by exploiting the regularity in inflectional forms. We plug the morphological analyzer with statistical pos tagger and chunker to see its impact on their performance so as to confirm its usability as a foundation for NLP applications. 1 Motivation and Problem Definition A highly inflectional language has the capability of generating hundreds of words from a single root. Hence, morphological analysis is vital for high level applications to understand various words in the language. Morphological analyzer forms the foundation for applications like information retrieval, POS tagging, chunking and ultimately the machine translation. Morphological analyzers for various languages have been studied and developed for years. But, this research is dominated by the morphological analyzers for agglutinative languages or for the languages like English that show low degree of inflection. Though agglutinative languages show high morpheme per word ratio and have complex morphotactic structures, the absence of fusion at morpheme boundaries makes the task of segmentation fluent once the model for implementation of morphotactics is ready. On this background, a morphological analyzer for highly inflectional language like Marathi which has the tendency to overlay the morphemes in a way that aggravates the task of segmentation presents an interesting case study. Eryiğit and Adalı (2004) propose a suffix stripping approach for Turkish. The rule based and agglutinative nature of Turkish allows the language to be modeled using FSMs and does not need a lexicon. The morphological analyzer does not face the problem of the changes taking place at morpheme boundaries which is not the case with inflectional languages. Hence, although apprehensible this model is not sufficient for handling the morphology of Marathi. Many morphological analyzers have been developed using the two-level morphological model (Koskenniemi, 1983) for morphological analysis. (Oflazer, 1993; Kim et al., 1994) have been developed using PCKimmo (Antworth, 1991), a morphological parser based on the two-level model. Conceptually, the model segments the word in its constituent parts, and accounts for phonological and orthographical changes within a word. While, the model proves to be very useful for developing the morphological analyzers for agglutinative languages or the languages with very less degree of inflection, it fails to explicitly capture the regularities within and between paradigms present in the inflectional languages. Marathi has a well defined paradigm-based system of inflection. Hence, we decided to develop our own model which works on the similar lines of PC-Kimmo (Antworth, 1991) but exploits the
منابع مشابه
Morphological Analyzer for Affix Stacking Languages: A Case Study of Marathi
In this paper we describe and evaluate a Finite State Machine (FSM) based Morphological Analyzer (MA) for Marathi, a highly inflectional language with agglutinative suffixes. Marathi belongs to the Indo-European family and is considerably influenced by Dravidian languages. Adroit handling of participial constructions and other derived forms (Krudantas and Taddhitas) in addition to inflected for...
متن کاملAutomated Paradigm Selection for FSA based Konkani Verb Morphological Analyzer
A Morphological Analyzer is a crucial tool for any language. In popular tools used to build morphological analyzers like XFST, HFST and Apertium’s lttoolbox, the finite state approach is used to sequence input characters. We have used the finite state approach to sequence morphemes instead of characters. In this paper we present the architecture and implementation details of a Corpus assisted F...
متن کاملFinite-State Back-Transliteration for Marathi
In this paper, we describe the creation of an open-source, finite-state based system for backtransliteration of Latin text in the Indian language Marathi. We outline the advantages of our system and compare it to other existing systems, evaluate its recall, and evaluate the coverage of an open-source morphological analyser on our back-transliterated corpus.
متن کاملFinite-State Morphological Analysis for Marathi
This paper describes the development of free/open-source morphological descriptions for Marathi, an Indo-Aryan language spoken in the state of Maharashtra in India. We describe the conversion and usage of an existing Latin-based lexicon for our Devanagari-based analyser, taking into account the distinction between full vowels and diacritics, that is not adequately captured by the Latin. Marathi...
متن کاملFinite-State Morphological Analysis Of Persian
This paper describes a two-level morphological analyzer for Persian using a system based on the Xerox finite state tools. Persian language presents certain challenges to computational analysis: There is a complex verbal conjugation paradigm which includes long-distance morphological dependencies; phonological alternations apply at morpheme boundaries; word and noun phrase boundaries are difficu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010